Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try nuking ShardLayout::V0 #12313

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Try nuking ShardLayout::V0 #12313

wants to merge 16 commits into from

Conversation

eagr
Copy link
Contributor

@eagr eagr commented Oct 25, 2024

No description provided.

@eagr eagr requested a review from a team as a code owner October 25, 2024 09:47
@eagr eagr requested a review from Longarithm October 25, 2024 09:47
@eagr eagr marked this pull request as draft October 25, 2024 09:50
@eagr
Copy link
Contributor Author

eagr commented Oct 25, 2024

can you guys do something like cargo test -p near-chain-configs without dependency issues? @wacban

@wacban
Copy link
Contributor

wacban commented Oct 25, 2024

can you guys do something like cargo test -p near-chain-configs without dependency issues? @wacban

It fails for me actually, that's not great. I typically run it on the whole workspace and just filter to the tests that I want. Also we use nextest framework, rather than test, though I have no clue as to why. It's suboptimal but I never bothered to optimize this part of my work flow.

cargo nextest run <test> 

@wacban
Copy link
Contributor

wacban commented Oct 25, 2024

If you feel like fixing it, go for it. It looks like it's only a matter of adding some dependencies to the cargo file.

@wacban
Copy link
Contributor

wacban commented Oct 25, 2024

JFYI this PR is marked as draft, please make it as ready for review when it is.

@eagr
Copy link
Contributor Author

eagr commented Oct 26, 2024

If you feel like fixing it, go for it. It looks like it's only a matter of adding some dependencies to the cargo file.

It seems like this is expected behavior. If it's not bothering anyone else, not sure if it needs fixing. And it could be easily mitigated by adding an --all-features flag to the command.

@eagr eagr force-pushed the deprec-shard-v0 branch 3 times, most recently from 02b02f4 to 7f64b44 Compare October 28, 2024 04:50
@eagr eagr marked this pull request as ready for review October 28, 2024 06:18
}

/// Construct a layout with given number of shards
pub fn of_num_shards(num_shards: NumShards, version: ShardVersion) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got a better idea for the fn name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe multi_shard, just to mach the single_shard one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first thought but it could also be used to create a single-shard layout, so I changed my mind. But if you like that name I'm also down with it. :)

@@ -1087,7 +1075,7 @@ pub fn create_localnet_configs_from_seeds(
.map(|seed| InMemorySigner::from_seed("node".parse().unwrap(), KeyType::ED25519, seed))
.collect::<Vec<_>>();

let shard_layout = ShardLayout::v0(num_shards, 0);
let shard_layout = ShardLayout::of_num_shards(num_shards, 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause some sanity check to fail as you could see from the CI logs. It seems like some json parsing issue. Not sure whether if you'd like to keep it as it was or to update the config somewhere else to make it work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try updating the config and if it doesn't work leave as is.

nearcore/src/config.rs Outdated Show resolved Hide resolved
let error_message = format!("{}", error).to_lowercase();
tracing::info!(target: "test", "error message: {}", error_message);
assert!(error_message.contains("shard"));
let _res = env.clients[0].process_chunk_state_witness(witness, witness_size, None, signer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a panic from get_shard_index() after switching to V2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's pretty bad. Feel free to either:

  1. Fix it (may be complicated / lots of code if you need to add error handling)
  2. Leave as is but put a TODO(wacban) in there instead of FIXME and I will have a look.
  3. Make the default shard layout V1 (hopefully this works?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try 3 (should probably work from the look of the code) which seems like a nice middle ground before finishing transition to V2

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice, answered some questions

Comment on lines 197 to 198
// FIXME eagr what should be the default?
#[default(ShardLayout::v0(1, 0))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it should use the single_shard method that returns the most recent (today it's V2) shard layout.

nit: The convention here seems to be to use the default_ function to provide the default value.

core/primitives/src/shard_layout.rs Show resolved Hide resolved
}

/// Construct a layout with given number of shards
pub fn of_num_shards(num_shards: NumShards, version: ShardVersion) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe multi_shard, just to mach the single_shard one?

core/primitives/src/shard_layout.rs Show resolved Hide resolved
let error_message = format!("{}", error).to_lowercase();
tracing::info!(target: "test", "error message: {}", error_message);
assert!(error_message.contains("shard"));
let _res = env.clients[0].process_chunk_state_witness(witness, witness_size, None, signer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's pretty bad. Feel free to either:

  1. Fix it (may be complicated / lots of code if you need to add error handling)
  2. Leave as is but put a TODO(wacban) in there instead of FIXME and I will have a look.
  3. Make the default shard layout V1 (hopefully this works?)

nearcore/src/config.rs Outdated Show resolved Hide resolved
@@ -1087,7 +1075,7 @@ pub fn create_localnet_configs_from_seeds(
.map(|seed| InMemorySigner::from_seed("node".parse().unwrap(), KeyType::ED25519, seed))
.collect::<Vec<_>>();

let shard_layout = ShardLayout::v0(num_shards, 0);
let shard_layout = ShardLayout::of_num_shards(num_shards, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try updating the config and if it doesn't work leave as is.

@@ -21,7 +21,6 @@ impl CorruptStateSnapshotCommand {
let mut store_update = store.store_update();
// TODO(resharding) automatically detect the shard version
let shard_layout = match self.shard_layout_version {
0 => ShardLayout::v0(1, 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this one?

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good,

I think serde doesn't like (de)serializing maps with non-string keys, like the ones in V2 and it breaks the tests. Feel free to fallback to V1 is it's too crazy to fix in this PR.

Comment on lines 876 to 900
#[test]
fn test_shard_layout_v0() {
let num_shards = 4;
let shard_layout = ShardLayout::v0(num_shards, 0);
let mut shard_id_distribution: HashMap<ShardId, _> =
shard_layout.shard_ids().map(|shard_id| (shard_id.into(), 0)).collect();
let mut rng = StdRng::from_seed([0; 32]);
for _i in 0..1000 {
let s: Vec<u8> = (&mut rng).sample_iter(&Alphanumeric).take(10).collect();
let s = String::from_utf8(s).unwrap();
let account_id = s.to_lowercase().parse().unwrap();
let shard_id = account_id_to_shard_id(&account_id, &shard_layout);
assert!(shard_id < num_shards);
*shard_id_distribution.get_mut(&shard_id).unwrap() += 1;
}
let expected_distribution: HashMap<ShardId, _> = [
(ShardId::new(0), 247),
(ShardId::new(1), 268),
(ShardId::new(2), 233),
(ShardId::new(3), 252),
]
.into_iter()
.collect();
assert_eq!(shard_id_distribution, expected_distribution);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep this one, the V0 may still be used when replaying some very old blocks.

@eagr
Copy link
Contributor Author

eagr commented Oct 29, 2024

I think serde doesn't like (de)serializing maps with non-string keys, like the ones in V2 and it breaks the tests. Feel free to fallback to V1 is it's too crazy to fix in this PR.

Then I guess it needs a custom de/serializer that converts the keys to strings and back. I'll give it a shot if it's not too complicated.

@wacban
Copy link
Contributor

wacban commented Oct 30, 2024

JFYI I had a look at the test failure in CI. It seems like something somewhere has the shard layout version hard coded to 0 where in your PR you (correctly) use the provided version. It's a bit wild, I'll keep digging.

} else {
ShardLayout::v0_single_shard()
};
let shards = ShardLayout::multi_shard(num_shards, 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the runtime-params-estimator test you can to set the version here to 0. It's suboptimal and definitely buggy but I don't think it's worth properly debugging this rather old test framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my it breaks a bunch of other tests. I guess it will be easier to fix it here after all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the runtime-params-estimator test you can to set the version here to 0.

done

Copy link

codecov bot commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 85.03937% with 19 lines in your changes missing coverage. Please review.

Project coverage is 71.15%. Comparing base (8e30ccd) to head (1c502af).

Files with missing lines Patch % Lines
core/primitives/src/shard_layout.rs 82.52% 6 Missing and 12 partials ⚠️
chain/chain/src/test_utils.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12313      +/-   ##
==========================================
- Coverage   71.19%   71.15%   -0.05%     
==========================================
  Files         839      839              
  Lines      169743   169831      +88     
  Branches   169743   169831      +88     
==========================================
- Hits       120851   120844       -7     
- Misses      43633    43717      +84     
- Partials     5259     5270      +11     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.27% <48.67%> (+0.04%) ⬆️
integration-tests 38.99% <49.60%> (-0.01%) ⬇️
linux 70.58% <85.03%> (-0.07%) ⬇️
linux-nightly 70.73% <85.03%> (-0.05%) ⬇️
macos 50.40% <83.46%> (-0.03%) ⬇️
pytests 1.57% <49.55%> (+0.03%) ⬆️
sanity-checks 1.38% <48.67%> (+0.04%) ⬆️
unittests 64.13% <83.46%> (-0.02%) ⬇️
upgradability 0.21% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eagr
Copy link
Contributor Author

eagr commented Oct 30, 2024

The failing test could be fixed by adding some feature flags. But that's from master, does it need to be fixed here?

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for this contribution! Just a few final nits. Most are optional, I only really care about restoring the assertion in the resharding test.

shards_split_map: None,
shards_parent_map: None,
version,
})
}

/// Return a V0 Shardlayout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mark it as deprecated? I don't know how to do this properly in rust, if it's not straight forward then a comment should do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about marking ShardLayout::V0 as deprecated? This way any usage of V0 would raise a deprecation warning including calling v0().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about both? :)

Comment on lines -151 to -157
// Shard layouts V0 and V1 are rejected.
assert!(ReshardingEventType::from_shard_layout(
&ShardLayout::v0_single_shard(),
block,
prev_block
)
.is_err());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this?

@@ -2294,7 +2294,7 @@ fn test_protocol_version_switch_with_shard_layout_change() {
epoch_manager.get_epoch_info(&epochs[1]).unwrap().protocol_version(),
new_protocol_version - 1
);
assert_eq!(epoch_manager.get_shard_layout(&epochs[1]).unwrap(), ShardLayout::v0_single_shard(),);
assert_eq!(epoch_manager.get_shard_layout(&epochs[1]).unwrap(), ShardLayout::single_shard(),);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mini nit: remove the trailing comma

Comment on lines +220 to +221
let id_to_index_map =
layout.id_to_index_map.iter().map(|(k, v)| (k.to_string(), *v)).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely fine but I would be tempted to write a generic function that converts a Map<ShardId, T> to Map<String, T> and use it for all the maps in the shard layout. We're looking at whooping potential savings of ~2 lines of code so up to you if you think it's worth it :)

Comment on lines +252 to +256
let id_to_index_map = layout
.id_to_index_map
.into_iter()
.map(|(k, v)| Ok((k.parse::<u64>()?.into(), v)))
.collect::<Result<_, Self::Error>>()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about generic function for this but here it may actually make some sense because it's less trivial logic.

impl TryFrom<SerdeShardLayoutV2> for ShardLayoutV2 {
type Error = Box<dyn std::error::Error + Send + Sync>;

fn try_from(layout: SerdeShardLayoutV2) -> Result<Self, Self::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mini nit: May unpack the layout as first step and then use the unpacked values directly? It may be a bit prettier and it would be more obvious that there isn't unnecessary cloning.

core/primitives/src/shard_layout.rs Show resolved Hide resolved
}
}

impl<'de> serde::Deserialize<'de> for ShardLayoutV2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the convention is to call the lifespan 'a.

shards_split_map: None,
shards_parent_map: None,
version,
})
}

/// Return a V0 Shardlayout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about marking ShardLayout::V0 as deprecated? This way any usage of V0 would raise a deprecation warning including calling v0().

}

/// Can be used to construct a multi-shard layout, mostly for test purposes
pub fn multi_shard(num_shards: NumShards, version: ShardVersion) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about n_shard() in the sense of creating an N-shard layout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan tbh. How about just new or new_test?

@wacban
Copy link
Contributor

wacban commented Oct 31, 2024

I tried the pre-merge tests and unfortunately some are failing. Those are the most expensive tests that only run before merging to master. I tried a simple debug but I couldn't fix it easily. I'm afraid we may need to restore kv_runtime to use v0 for now to make it pass. I'm testing this change again on a fork from your PR:
38ca3a1
test run reference for myself:
https://nayduck.nearone.org/#/run/564

Screenshot 2024-10-31 at 11 59 48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants